Robots.txt Disallow All: Blocking AI Bots is as misguided as blocking Google in the 90s!

Maximizing your reach with marketing strategies

Published on: Apr 23, 2025

Updated on:

My GEO journey began when Copilot critiqued my startup, I chose to learn, not ignore. That curiosity led to media features and being named the #1 GEO Consultant by YesUsers.

Avinash Tripathi Image
Avinash Tripathi

Introduction: Lessons from the Past

In the late 90s, many media companies decided to block search engine bots, including Google, from crawling their websites by using robots.txt disallow all feature in their robots.txt file. They felt that search engines were unfairly exploiting their content. But boy, oh boy! It was a big mistake when it came to their web traffic. Over time, they came to realize that collaboration, not exclusion, drove visibility, traffic, and revenue.

Mark Twain once said, “History Doesn’t Repeat Itself, but It Often Rhymes.” In line with this statement, businesses today grapple with a similar dilemma about AI and LLM crawlers such as GPTBot and PerplexityBot.

The AI Crawler Dilemma: Visibility vs. Protection

There is growing anxiety among content creators and businesses about how proprietary data might be utilized to train these models.

The concerns revolve around misuse and the potential for distorting their intellectual property. Despite these valid worries, it’s essential to consider the implications of shutting out AI crawlers entirely.

In today’s AI-driven world, a wholesale ban on all AI bots, such as GPTBot and PerplexityBot, could no doubt prevent your content from being used to train large language models (LLMs), but it will also make your brand, company, and offerings invisible to these LLMs.

My perspective on this issue is to strike a balance. I advise allowing access to these bots while denying access to your copyrighted and subscription-based content. This approach, if implemented effectively, will enable you to safeguard your interests while boosting your brand’s online presence and user engagement, offering a promising outlook for your brand’s future.

What is Robots.txt? A Modern Guide

A robots.txt file is kind of like a concert pass, telling web crawlers who can get in and where they’re allowed to go. Just as only those with a backstage pass can access restricted areas at a concert, robots.txt lets you specify which parts of your website search engine bots can visit and which areas are off-limits.

This file helps manage how search engines crawl and index your site, preventing them from accessing sensitive or unnecessary pages and helping to reduce server load. However, it’s important to remember that not all bots respect these rules—How it Works:

It uses simple rules to instruct crawlers, such as Disallow (to prevent crawling specific URLs) and Allow (to allow crawling specific URLs).

Not a guarantee:

It’s important to remember that a robots.txt file doesn’t guarantee that a page won’t be indexed. It’s a suggestion to crawlers, and some may ignore your robots.txt entirely. For this reason, robots.txt shouldn’t be relied on as a security measure since determined or malicious crawlers can easily bypass it.

Purpose of Robots.txt:

Website owners use robots.txt to:

  • Manage crawler traffic and prevent server overload.
  • Block specific directories or files from being crawled.
  • Guide crawlers to important pages for indexing.

How to Find or Upload Your Robots.txt file

You can usually find a website’s robots.txt file by adding /robots.txt to the end of the website’s URL (e.g., example.com/robots.txt). If there is none, your robots.txt file should be placed in the root directory of the website.

Pro Tip: If You're Using CMS

If you’re using a Content Management System (CMS) such as WordPress, Wix, or Blogger, you don’t need to create or edit your robots.txt file. Moreover, you might also be using a plugin like Yoast or AI Monitor WP on top of your CMS.

In such a case, a search settings page or something similar helps you manage whether search engines can crawl your page.

If you want to keep a page hidden from search engines or make it visible again, check out how to adjust your page’s visibility in your CMS. For example, search “Yoast hide page from search engines” to find what you need.

The Ideal Robots.txt File in 2025

Here’s what makes a robots.txt file ideal in today’s AI-dominated information discovery process:

1. User-agent Directive

The User-agent directive is crucial—it specifies which crawlers (also known as bots) the rules apply to.

A common mistake is mentioning only Googlebot. Instead, it’s ideal to use User-agent: *, which applies the rules universally to all crawlers. This ensures your directives aren’t limited to just one search engine but are inclusive and applicable to the broader bot community.ich crawlers (also known as bots) the rules apply to.

Example:

User-agent: *

Why does this matter?

Not all web traffic comes from Google—so universal bot coverage maximizes your site’s reach while managing crawler activity effectively.

2. Allow and Disallow Directive

The Allow and Disallow directives are the backbone of your robots.txt file, dictating which parts of your site are accessible to crawlers and which are restricted. Used strategically, they balance visibility with protection. Here’s how to wield them effectively:

User-agent: * 

Disallow: /private

Translation: “All bots: Stay out of my private folder!”

User-agent: *

Disallow: /secret-lab/ 

Allow: /public-cat-videos/

Translation: “All bots: stay out of my secret lab (no one needs to see my failed robot uprising blueprints), but feel free to binge my cat videos!”

User-agent: *  

Disallow: /

Allow: /blog/

Translation: “All bots: Block my entire site except the /blog/ directory.”

Granular Control for AI Crawlers

To future-proof for AI, apply rules specifically for LLM bots like GPTBot or PerplexityBot:

Example:

User-agent: GPTBot

Disallow: /ChatGPT-clone/

Allow: /blog/  

User-agent: * 

Disallow: /user-dashboards/

Translation: Blocks AI crawlers from the ChatGPT clone you are working on in your free time. However, this allows them to index public content (e.g., blogs) for visibility in AI tools.

Things to Avoid in Robots.txt

Conflicting Rules:

Disallow: /blog/  

Allow: /blog/latest-news/

Outcome: Some crawlers (like Google) will allow /blog/latest-news/, while others may ignore the Allow directive.

Overly Broad Blocks:

Disallow: /blog/

Allow: /blog/latest-news/

Outcome: This blocks your entire site—use only if you want zero visibility.

As mentioned earlier, this is not a watertight method to ensure compliance. You must p

3. Crawl Delay

An ideal crawl delay in robots.txt generally ranges from 1 to 10 seconds, with 10 seconds being the most common suggestion. This delay, which is specified using the Crawl-delay: directive, tells search engine crawlers how long to wait between requesting pages from your website.

User-agent: *  
Disallow: /proprietary-data/  
Allow: /  
Crawl-delay: 10

Translation: Don’t come back knocking before the 10 seconds have passed.

4. Sitemap Directive

The Sitemap directive is a guiding star for crawlers. It tells them where to find the sitemap file—a comprehensive list of your site’s URLs. This makes it easier for bots to understand your site’s structure and index it efficiently.

Example:

Sitemap: https://www.example.com/sitemap.xml

Why does this matter?

A well-placed Sitemap directive ensures search engines have all the vital info they need to index your site properly, boosting your visibility.

Update this file and add new rules as your site evolves. That ”/ai-pet-rock-store/” directory? Yeah, block it now.

Robots.txt Example for 2025: Future-proofed courtesy AI Monitor

User-agent: *  
Disallow: /secret-lab/  
Disallow: /proprietary-data/  
Allow: /  
Crawl-delay: 10 
Sitemap: https://yoursite.com/sitemap.xml

Why Blocking AI Crawlers Is a Strategic Mistake

The Precedent of Search Engines:

Brands that embraced SEO thrived; those that resisted faded into obscurity. Similarly, LLMs will shape future discovery.


Case Study:

One of our clients saw a 46% traffic drop after blocking AI bots, while a competitor that allowed them gained featured snippets in AI tools.

Checklist for Website Owners and Content Creators

  • ☑️ Audit Your robots.txt:Ensure it’s not disallowing AI crawlers (e.g., GPTBot).
  • ☑️ Segment Access:Use granular rules to protect paid content or confidential data.
  • ☑️ Monitor Compliance:We have a free tool called AI Bot Monitor that you can use to track bot activity.

Conclusion: Adapt or Be Invisible

In my personal opinion, blocking AI crawlers today is as myopic as blocking Google in the 90s. The key lies in strategic access, shielding critical data while ensuring your brand remains part of the AI-driven conversation. Update your robots.txt, embrace transparency, and position your content for the future.

Your next step? Review your robots.txt at yoursite.com/robots.txt—before AI overlooks your business entirely.

Frequently asked questions!

  • Why is blocking all AI bots like GPTBot a bad idea?

    The act of blocking all AI bots resembles the historic practice of banning Google search engines because it removes your business from AI search capabilities. The action of blocking content from training LLMs results in your brand becoming invisible in generated AI answers which leads to reduced engagement opportunities and visitor traffic.

  • Is robots.txt a foolproof way to stop AI bots from scraping my site?

    No. Robots.txt is a suggestion, not a security measure. Ethical bots (like Googlebot or GPTBot) respect it, but malicious scrapers may ignore it. For sensitive data, use stronger protections like authentication, paywalls, or legal measures (e.g., terms of service).

  • How do I check if my site is blocking AI crawlers??

    To check if your site is blocked, you need to visit yoursite.com/robots.txt and verify the presence of either User-agent: GPTBot or Disallow: / entries. The use of Disallow: / for all bots will completely conceal your site from every type of interstellar and artificial intelligence tool.

  • What’s the ideal crawl delay to prevent server overload?

    A 10-second delay (Crawl-delay: 10) is a good balance—it reduces server strain while letting bots index your content efficiently. Adjust based on your site’s traffic and hosting capacity.

  • Will AI bots index my site even if I don’t mention them in robots.txt?

    Yes! Most artificial intelligence crawlers, including GPTBot, execute the User-agent: * command. Crawlers will access your site as regular bots do if you do not precisely block them from your robots.txt file. You should block GPTBot by deploying User-agent: GPTBot Disallow: /.

  • How can I future-proof my robots.txt for AI?

    Use User-agent: * in your robots.txt file to allow broad bot access. Make sure public content like /blog/ is accessible to AI crawlers, while sensitive directories such as /private/ are blocked. Include a sitemap to help bots navigate your site efficiently. To stay informed, monitor bot activity using tools like AI Bot Monitor—this helps you track engagement, spot anomalies, and optimize your crawl strategy over time.